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Abstract 



Calculating aggregation operators of moving point objects, using time as a continuous variable, 
presents unique problems when querying for congestion in a moving and changing (or dynamic) query 
space. We present a set of congestion query operators, based on a threshold value, that estimate the 
following 5 aggregation operations in d-dimensions. 1) We call the count of point objects that intersect 
the dynamic query space during the query time interval, the CountRange. 2) We call the Maximum 
(or Minimum) congestion in the dynamic query space at any time during the query time interval, the 
MaxCount (or MinCount). 3) We call the sum of time that the dynamic query space is congested, 
the ThresholdSum. 4) We call the number of times that the dynamic query space is congested, the 
ThresholdCount. And 5) we call the average length of time of all the time intervals when the dy- 
namic query space is congested, the ThresholdAverage. These operators rely on a novel approach to 
transforming the problem of selection based on position to a problem of selection based on a threshold. 
These operators can be used to predict concentrations of migrating birds that may carry disease such as 
Bird Flu and hence the information may be used to predict high risk areas. On a smaller scale, those 
operators are also applicable to maintaining safety in airplane operations. We present the theory of our 
estimation operators and provide algorithms for exact operators. The implementations of those oper- 
ators, and experiments, which include data from more than 7500 queries, indicate that our estimation 
operators produce fast, efficient results with error under 5%. 

1 Introduction 

Safety can often be reduced to to a problem of congestion. The safety of flight depends on separation of 
airplanes or more generally the maximum number of airplanes that a particular airspace can safely contain, 
and the maximum number of airplanes that air traffic controllers (ATC) responsible for directing airplanes 
can safely track. When considering epidemics, the presence of a single animal with Bird Flue does not 
does not indicate the start of an epidemic. Instead the presence of a certain number of instances of the 
disease indicates a high risk of starting an epidemic, or actual epidemic conditions. Consequently, we see 
that congestion often links to safety and can predict high risk or even dangerous conditions. 

Congestion is defined differently depending on the application. Hence it is necessary to provide aggrega- 
tion operators that take a threshold value as a parameter to define congestion. 

In relational databases, Max, Min, Count, Sum and Average form the set of natural aggregation- 
operators. Spatiotemporal databases containing moving objects, based on continuous time, can not apply 
these operators in the same way. However, these operators may still function in interesting ways for moving 
objects. For example, one can ask how many moving point objects exist within a moving and changing (or 
dynamic) rectangular area at a certain time, or what is the maximum distance between two moving points 
at certain times. Obviously, when we are interested in discrete time instances, then the moving point object 
database can be reduced to a relational database and the above queries can be expressed as simple Count 
or Max queries. 

Moving object databases naturally suggest new aggregate operators that have no equivalents in rela- 
tional databases. For example, one may ask what is the maximum number of moving-point objects that 
exist simultaneously within a dynamic rectangular area at any time during a time interval T? We call this 
the MaxCount query (symmetrically we can also find the Min-Count). One may also ask during what 
time intervals in T does there exist more than M moving objects within a rectangular area? We call this 
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the ThresholdRange. We show that a strong relationship exists between MaxCount and Threshol- 
dRange, and we show that ThresholdRange forms the bases for a family of threshold operators that 
include: ThresholdCount, ThresholdSum, and ThresholdAverage. A related, though less complex, 
operator answers the question: what is the number of moving objects that exist within or intersect a dynamic 
rectangular area at any time instance during interval T. We call this type of query the CountRange query. 
We give the following definitions for aggregation operators: 

Definition 1 (Dynamic Query Space) Dynamic query space is defined by a continuous time interval T , 
and a d-dimensional space that may move and change size or shape over the query time interval. 

Throughout this paper we consider the shape of the query space to be a box or cube. 

Definition 2 (MaxCount (MinCount)) Let S be a set of moving points. Given a dynamic query space 
R defined by two moving points Qi and Q2 as the lower-left and upper-right corners of R, and a time interval 
T, the MaxCount fMin-Count ) operator finds the time i max ( m i n ) and maximum (or minimum) number of 
points M max ( m j n ) in S that R can contain at any time instance within T. 

Throughout this paper we develop the MaxCount operator because where ever we find a maximum, a 
minimum can be found similarly. 

Definition 3 (ThresholdRange) Let S be a set of moving points. Given a dynamic query space R defined 
by two moving points Q\ and Q2 as the lower-left and upper-right corners of R, a time interval T, and a 
threshold value M, the ThresholdRange operator finds the set of time intervals Tm where the count of 
objects in R is larger than M. 

ThresholdRange is directly related to MaxCount in that when M is raised to M max , then Thresh- 
oldRange returns a time interval containing i max and during this time interval, the count will be M max . 

Definition 4 (ThresholdCount) Given a ThresholdRange, ThresholdCount returns the number 
of time intervals. 

Definition 5 (ThresholdSum) Given a ThresholdRange, ThresholdSum returns the total time T s 
during which the count is above M. That is, for each Ti <E Tm, ThresholdSum return: 



where |Tj| means the length of the interval. 

Definition 6 (ThresholdRange) Given a ThresholdRange, ThresholdAverage returns the average 
length of the intervals in Tm ■ 

In addition to the threshold aggregation operators, we also use our bucketing method to implement the 
CountRange defined as follows. 

Definition 7 (CountRange) Let S be a set of moving points. Given a dynamic query space R defined 
by two moving points Q\ and Q2 as the lower-left and upper-right corners of R and a time interval T, the 
CountRange query returns the total number of points that intersect R in T. 

Together MaxCount (MinCount) and the threshold operators form a complete set of threshold ag- 
gregation operators comparable to the aggregation operators given in relational databases. 

The following examples use the simple concepts of flying to demonstrate the use of a few of these threshold 
aggregation operators. 

Example 8 Airplanes are commonly modeled as linearly moving objects with preestablished flight plans. 
Suppose, at any time, at most a constant number M of airplanes is allowed to be in the O'Hare airspace to 
avoid congestion. Suppose also a new airplane requests approval of its flight plan for entering the O'Hare 




(1) 



2 



airspace between times t a and % . The air traffic controllers can avoid congestion as follows. If after adding 
a new flight plan, the MaxCount between t a and tf, is still less than M , then they can approve the flight. 
Otherwise, they need to find some alternative path, and check it again against the database. 

Air traffic controllers try to direct airplanes as linearly moving objects for fuel efficiency, among other 
reasons. If they recognize a developing congestion too late, then they often must direct the airplane to fly in 
circles until the congestion has cleared. That solution wastes fuel. On the other hand, if they recognize the 
developing congestion early, then they can often simply tell the airplane to change its speed, which saves fuel. 
Therefore, it is important to identify congestions as early as possible. We may identify congestions by using 
a MaxCount query where a moving box around the airplane and a time interval [t a ,tb] define the query. If 
the MaxCount predicts congestion, then the airplane 's speed can be adjusted early in the flight. 

Example 9 Suppose we want to alert pilots if their current flight path takes them through at least one 
congested region. 

Traffic Alert /Collision Avoidance Systems (TCAS) is a system that provides similar functionality. TCASs 
only provide alerts for current congestion, not predictive congestion. Although TCASs were implemented in 
1986, we continue to have mid- air collisions and near misses indicating that the system still needs improve- 
ment. ThresholdRange is a modification of MaxCount that returns all predicted time intervals on the 
flight path where the Count exceeds a given threshold. Hence using ThresholdRange we can alert a pilot 
of predicted congestions where more than M other airplanes will be within the space B around the airplane. 
Predicting and avoiding these areas can significantly reduce the chances of mid- air collisions. 

Example 10 Suppose we are especially concerned about a rush-hour period [t a ,t},\ that is particularly stress- 
ful to air traffic controllers. Suppose controllers can direct at most M airplanes safely. We can determine the 
number of controllers needed during the rush-hour time by executing the CountRange query over the con- 
trolled airspace during the rush-hour and dividing by M . By ensuring that a sufficient number of controllers 
are present, safety is achieved and controllers are not over stressed. 

Each of the operators can also be applied to examine different aspects of congestion with regard to bird 
migration and hence disease control. These questions and examples, motivated by research on MaxCount, 
led us to explore complex threshold aggregations and data structures to support them. 

The rest of this paper is organized as follows. Section [2] gives some background on the concepts of point 
domination, sweeping techniques and then introduces the data structures used to build buckets. These buck- 
ets can then be used in various indexing algorithms to fit the type of application used. Section [3] develops 
the MaxCount estimation algorithm using a running example. Section [4] develops the ThresholdRange 
algorithm based on MaxCount and demonstrates the relationship that ties MaxCount to the remaining 
threshold operators. This section also develops algorithms for each of those operators including COUN- 
tRange. Section [5] gives the experimental results of the implementation. Section [6] reviews the related work 
and Section [7] gives conclusions and future work. 



2 Hyper-Bucket Data Structures 



This section presents an updatable skew-aware bucket for indices that models the skewed point distributions 
in each bucket. The skew-aware technique allows the index structure to perform inserts, deletes, and updates 
in fast constant time using a HashTable to store the buckets. Many spatiotcmporal applications, such 
as tracking clients on a wireless network, particularly need these fast updates and no other MaxCount 
presented prior to this can meet that requirement. Because the buckets are spatially defin ed, the bucketin g 
technique also easily adapts to other spatial and spatiotemporal indices such as the R-tree iGuttmanl (|1984l ). 
Hence the technique performs well for applications where search operations or update operations occur more 
frequently by using an appropriate index. 

Our algorithm uses a sweeping meth o d to evaluate the thres hold agg regation opera tors similar to previous 
approaches from lChen fc Revesd ( 2004 ). Revesz fc Chenl (2003) and Anderso n (2006). The algorithm differs 
in that the sweeping algorithm integrates a skew-aware density function over the spatial dimensions of the 
bucket to obtain the ti me dependent count function. The d ensity function in the bucket increases accuracy 
over methods given in (|Chen fc Reveszll2004LlAndersonll2006h while maintaining the same number of buckets. 
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This idea is a crucial improvement because we model the point distribution skew in a bucket, whereas previous 
methods adapted to skew by increasing the number of buckets or changing their shape and contents. We also 
present a precise algorithm for evaluating the threshold aggregation operators that requires no index and runs 
in 0(N) + 0(n log n) time and 0(n) space where N is the number of points in the database and n is the value 
of a CountRange query using the same query space and time. Both the threshold aggregation algorithms 
and the skew-aware bucket data structure presented are implemented and analyzed in 3-dimensional space. 
We show that the approximation achieves good results while significantly reducing the running times. 

Section |2~T1 describes the problems related to creating hyper-buckets (also referred to as just buckets) and 
a specific solution for creating 6-dimensional buckets for 3-dimensional linearly moving points. In all cases, we 
can extend our method to d-dimensions. Section \%?Z\ describes the method for inserting and deleting a point 
from a bucket and shows that updates take constant time. Section \2 .31 applies two different data structures to 
contain the buckets suited for applications where either inserts and deletes or threshold aggregation queries 
dominate. 

2.1 Hyper-Bucket Data Structure 

Definition 11 (Hex Representation) Define each 3-dimensional linearly moving point p by parametric 
linear equations in t as follows: 

{Px = v x t + x 
Py = v yt + Vo (2) 
Pz = v z t + z 

where the corresponding hex representation of p is the tuple (v x , xq, v y , yo, v z , Zo) containing the duals of p x , 
p y , and p z . For simplicity we often denote the six-tuple as (xi, ...,xq). 

Consider a relation D(x\ , .., xq) that contains the hex representation of linearly moving points in 3 dimen- 
sions. Then D represents a 6-dimensional static space. Divide the space into axis-aligned hyper-rectangles 
where the k th axis has dk divisions. Each hyper-rectangle becomes a bucket containing moving points whose 
hex falls inside the hyper-rectangle. 

Definition 12 (Hyper-bucket dimensions) Define the dimensions of each bucket Bi by inequalities of 
the form: 

v x ,L <v x < v XtU f\ x 0iL < x < x 0iU A 

V V ,L < V y < V VtU A V0,L < VO < 2/0,(7 A (3) 
V Z ,L <V Z < V Z<U A Z 0,L < Z < Z ,U 

where we denote the lower bound as: 

(v x ,L,x ,L,Vyx,yox,Vz : L, Zo,l) (4) 

and the upper bound as 

(v x ,u, %a,u,v y ,u, Vo,u, v z ,u, zq,u)- (5) 

Each hyper-rectangle defines the spatial dimensions of a possible bucket, where only buckets that contain 
points need be included in the index. The maximum number of possible buckets is given by m — Y[dk- 

k 

Definition 13 (Histograms) Given a 6-dimensional rectangle Bi, given by Definition \12l containing bi 
points, build the histograms hi 1 i,...,hi t 6 for each axis using s subdivisions as follows. To create histogram 
hi j, divide bucket Bi into s parallel subdivisions along the jth axis, and record separately the number of 
points within Bi that fall within each subdivision. 

Example 14 (Building Histograms) Consider a set of 6-dimensional points projected onto the v x ,xq 
plane as shown in Figure [1] Assume that the number of subdivisions is s = 10 along both v x and Xq. 
Figure [2] shows /i^i and hi t 2- For example, the subdivision < v x < 1 contains six points and hence the first 
bar of histogram /i^i rises to level 6. The other values can be determined similarly. 

Histograms tell much about the distribution of the points in a bucket but they introduce some ambiguity. 
For example, the histograms in Figure [5] match both of the 2ci-distributions in Figure[31 
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Figure 1: Points projected onto v x ,xq plane. 



Definition 15 (Axis Trend Function) The axis trend function fi,j(xj) is some polynomial function for 
bucket Bi and axis j such that the following hold: 

!• fi,j > over Bi. 

2. f[p the derivative fij, does not change sign over the valid range. 
The bucket trend function /j for bucket Bi is the following: 

/ n./, (6) 

Condition 1 ensures that the bucket trend function, built from the axis trend functions, does not contain 
a negative probability region. Condition 2 requires that the bucket density increase, decrease, or remain 
constant when considering any single axis. This condition avoids the ambiguity demonstrated in Figures [5] 
and [3] by giving a polynomial that approximates the density change correctly. We show this in the following 
Lemma. 



Lemma 16 Given a bucket Bi with bucket trend functions fij, let r\ and r-i be identically sized regions in 
bucket Bi. If the density in Bi along each axis monotonically increases from r\ to T2 the following holds: 

I fi dcf>> [ ft rf0 (7) 

J r 2 Jr\ 

Proof. Increasing densities from r± to r2 translates into histograms that also increase from r\ in the 
direction of r2 along each axis. The translation from histograms to the axis trend functions gives the 
following conditions: 

fi,j(X2,j)> fi,j{xij) (8) 

where x\j and X2j are the j coordinates of the points in n and r2 respectively, and are located the same 
distance from the j th coordinates of the lower bounds of n and r^ respectively. Since this constraint holds 
for each j and fi.j > we have: 

fi(x 2 ) > fi(xi) (9) 
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hi i: Points projected onto v x 



hi, 2- Points projected onto xq. 



Figure 2: Histogram of Points in 2 Dimensions. 



Hence by the properties of integration we conclude 



Ad0> / ft 



(10) 



Definition 1151 allows a whole class of polynomial functions, and Lemma [TBI applies to each member of that 
class. However, in the following, we use a particular polynomial function derived from the product of linear 
functions, which are obtained by using the least squares method for each histogram. 

Definition 17 (Normalized Trend Functions) Let n be the number of points in the database, bi the 
number of points in bucket Bi, and fi be given by Equation The normalized trend function Fi for bucket 
Bi is: 

hfi 



F 



(11) 



fi 



B, 



and the percentage of points in bucket Bi is: 



Fi d<j>. 



(12) 



With this definition we can calculate the number of points in 0(1) time using the following simple lemma. 

Lemma 18 Let Bi be a bucket, n the number of points in the databases, and p be given by Definition \17\ 
Then np is the number of points in bucket Bi and np is calculated in 0(1) time. 
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Figure 3: 2D Distribution Functions 

Proof. By Equation (jTTJ) and (fT2|) we have: 

np = n f Fi dcf> 

^ fi 



n 

B 



bi 
n — 

n 



n 



fi dxj> 

(13) 

fi dcj) 

Bi 



fi 

Bi 



Clearly the above calculations take only O(l) time. ■ 

Using the above definitions we can now define the bucket data structure used throughout the rest of this 
paper. 

Definition 19 (Skew Aware Buckets) A bucket is a hyper-rectangle with dimensions given by Defini- 
tion and that maintains histograms given by Definition \1S\ additional data for the least squares method, 
and the normalized trend function given by Definition \17\ Throughout the rest of this paper we refer to these 
as buckets. 

2.2 Inserts and Deletes 

We can maintain the bucket (and hence the index) while deleting or inserting a point for any bucket Bi by 
recalculating the trend function Fi for the bucket. 

Lemma 20 Insertion and deletion of a moving point can be done in O(l) time. 

Proof. When we insert or delete a point, we need to update the histograms and the normalized trend func- 
tion. Let the point to insert/delete be P a represented using the hex representation as (ao, ai, 02, (Z3, a^, a§), 
let dj, for < j < 5 be the bucket width in the j , and let s be the number of subdivisions in each histogram. 
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The concatenation of ido , . . . , id§ gives the IDi of bucket i to insert (or delete) P a into where each idi and 
< I < 5 is defined by: 

' ai 



idi = 



di 



(14) 



Then p is projected 

Uh 



The calculation of IDi and retrieving bucket Bi takes 0(1) time using a HashTable. 

Let hwij be the histogram-division width for the j th calculated as hwi.j = ^2. . 

onto each dimension to determine which division of the histogram to update. For the j th dimension the k 
division of histogram hij is given as follows: 

m = 



a , — idj * </ , 
hwh 



(15) 



Let hi j k be the histogram division to update for each histogram. Update h^j^ and the sums and 

^^Xiy.i from the normal equations in the least squares method. N, ^""^Xj and ^""^ from the normal 
equations do not need updating since the number of histogram divisions s is fixed within the database. 

We can now recalculate each foj in constant time by solving the 2x3 matrix corresponding to the normal 
equations of the least squares method for each histogram. For each /y calculate the endpoints to determine 
the required shift amount (Definition [T5l property 1) and calculate fi from Equation ([6]). Now we calculate 
Ki using Equation (fl7|) . Each of these steps depends only on the dimension of the database. Hence for any 
fixed dimension we can rebuild the normalized trend function Fi in O(l) time. ■ 



2.3 Index Data Structures 

There is no need to create a bucket unless it contains at least one point. We consider two classes of data 
structures for organizing the buckets: HashTables and Trees. 

For databases where inserts and deletes are the most common operation, the HashTable approach allows 
these operations to run in constant time. However, the MaxCount operation will require an enumeration 
of all the buckets and thus at least a running time of 0(B). As long as the number of buckets is reasonable, 
this approach works well. 

For databas e s where MaxCount is the most common operation, we may use an R-tree structure 
(|GuttmanHl984l . iBeckmann et al.lll990h where the elements to be inserted are the buckets. This approach 
speeds up the MaxCount query to 0(log \B\ + R) where R is the number of buckets needed to calculate 
the query. The insert and delete costs for these R-trees are 0(log \B\), because buckets do not overlap. 

Since buckets do not change shape, the database is decomposable and allows each type of aggregation 
to be calculated from simultaneous executions on subspaces of the index space. We discuss the method and 
ramifications of this capability at the end of Section 13.41 



3 Dynamic MaxCount 

Section |3 . 1 1 reviews point domination in higher dimensions. Section 13.21 examines finding the percentage of 
points in a bucket that are in the query space as a function of time. Section 13.31 puts the two previous 
sections together to create the dynamic MaxCount algorithm for d-dimensions. 

3.1 Point Domination in 6-Dimensional Space 

Let B be the set of 6-dimcnsional hyper-buckets in the input where each hyper-bucket Bi has an associated 
normalized trend function Fi as in Definition 1171 Let the vertices of Bi be denoted Vij where 1 < j < 64, 
because there are 2 6 corner vertices to a 6-dimensional hyper-cube. 
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Definition 21 (Point Domination) Given two linearly moving points in three dimensions 



P{t) 



p x = Xlt + x 2 

Py = X 3 t + Xi 
Pz = X 5 t + X 6 



and Q(t) 



Qx = v x t + x 
q y = Vy t + y 
q z = v z t + z 



Q(t) dominates P(t) if and only if the following holds: 

(Px < q x ) A (p y < q v ) A (p z < q z ). 



(16) 



(17) 



The previous definition takes 6-dimensional points denned in Definition [TT] and places them into three 
inequalities of the form x 2 < —t(x\ — v x ) + xq. Each inequality defines a region below a line with slope —t. 

Definition 22 (x-view, y-view and z-view projections) Projecting the inequalities from Definition \21\ 
onto their respective dual planes allows a visualization in three 2- dimensional planes. Define these three 
projections as the x—view, y~view and z—view respectively. Because the time —t defines the slopes of each 
line, all views contain lines with identical slopes. (See Figure ^ 

Definition 23 (Query Space) Given two moving query points Q\{t) and Q2{t) and lines l x \, l X 2, l y i, l y i, 
Izii lz2 crossing them in their respective hexes with slopes —t, the intersection of the bands formed by the 
area between l x ± and l X 2, l y i and l y 2, and l z \ and l z i in the 6-dimensional space forms a hyper-tunnel that 
defines the query space as shown in Figure^ 



X—view 



Y — view 



Z— view 



1x2 Ixl 




Position 



Velocity 




Velocity 




Velocity 



Figure 4: Views. 

We can now visualize the query in space and time as the query space sweeping through a bucket as the 
slopes of the lines change with time. Using the above, it is now easy to prove the following lemma. 

Lemma 24 At any time t, the moving points whose hex-representation lies below (or above) l x i,l v \ and l z \ 
in their respective views are exactly those points that lie below (or above) Q\ in the original 3-dimensional 
plane. 

Proof. Let Q x (t) = v x t + xo where v x and xq are constants and consider any x component of a point 
P x (t) — xit + X2 that lies below Q on the s-axis. Then 



X\t + X2 < V x t + Xq 

X 2 < —t(xi — V x )+XQ 



(18) 
(19) 



Obviously, at any time t these are the points below the line x 2 = —t(xi — v x ) + xq, which has a slope of 
—t and goes through (v x ,xq). This representation is the dual of point Q x . By Definition [531 this is exactly 
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the line l x \. We can prove similarly that the points with duals above l x \ are above Q\ at any time t. The 
proof that points whose hex-representations are above or below l y i, and l z \ are exactly those points that lie 
above or below Qi is similar to the proof for points above or below l x %. By Definition al! we conclude that 
the points dominated by Q\ in the dual space are those points that are below l x \, l y \, and l z \ in the x-view, 
y— view, and z-view, respectively. Similarly, we conclude that the points that dominate Qi in the dual space 
are those points that are above l x \, l y \, and l z \ in the a;- view, y— view, and z-view, respectively. ■ 

Throughout the examples in this chapter, we use the points shown in Figures [5] and [6] to demonstrate 
the evaluation of a MaxCount query. We begin by creating the index. 



ID 


Dimension 1 


Dimension 2 


Dimension 3 




XO 


X1 


X2 


X3 


X4 


X5 


1 


5.345 


7.543 


5.345 


8.158 


5.345 


5.488 


2 


6.354 


9.023 


6.354 


5.488 


6.354 


5.159 


3 


7.159 


8.885 


7.159 


6.685 


7.159 


7.346 


4 


7.645 


9.117 


7.645 


5.159 


7.645 


8.885 


5 


8.153 


7.346 


8.153 


6.335 


8.153 


7.543 


6 


8.156 


6.335 


8.156 


7.346 


8.156 


9.023 


7 


9.125 


5.159 


9.125 


9.117 


9.125 


9.117 


8 


9.118 


6.685 


9.118 


8.885 


9.118 


6.335 


9 


9.688 


5.488 


9.688 


9.023 


9.688 


8.158 


10 


9.874 


8.158 


9.874 


7.543 


9.874 


6.685 



Figure 5: Example points. 

Example 25 (Creating the Index) Consider a relation that contains the 6-dimensional space 10 units 
(0 . . . 10) in each dimension. If we break this up into buckets that are 5 units long in each dimension, we 
have 2 6 buckets. Although these divisions make a space with 64 buckets, all the points are contained in 
a single bucket whose index is (2,2,2,2,2,2). All the points listed in Figure [5] have the same velocities 
for each dual plane. Notice the columns for x\, 2:3, and x§ all have the same values in different orders. 
The projection of the points onto the 3 dual planes shown in Figure [6] does not immediately show this 
organization. Projecting the points for any view in Figure [7] onto each axis and creating histograms with 5 
divisions gives the histograms for the Velocity and Position axes shown in Figure [JJ Hence, each velocity 



Dimension 1 Dimension 2 Dimension 3 




56789 10 56789 10 56789 10 



V1 (xO) V2 (x2) V3 (x4) 

(a) (b) (c) 

Figure 6: Points projected onto (a) A-view, (b) F-view, and (c) Z-view. 

dimension has the same histogram. Similarly each position dimension has the same histogram. To create 
these histograms each point is projected onto the axis. For example point 1 projected onto the x\ axis is 
given as: 

5.345, 7.543, 5.345, 8.158, 5.345, 5.488 -> 5.345. (20) 
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Velocity 



Position 




(a) Velocity. (b) Position. 

Figure 7: Position and velocity histograms, identical for each view. 

Calculate the widths of the histograms as: 

Histogram-Width = (10 - 5)/5 = 1 (21) 

We determine the histogram for each point by looping through the points and calculating the following: 

division = [({point — lowerbound) / Histogram-Width)\ (22) 

For example the lowest and highest points in velocity would be added to the division calculated as [(5.84 — 5) /lj = 
and [ (9.468- 5)/lJ =4. 

The histograms translate into a set of points for each view given as: 

Velocity = {(0, 1), (1, 1), (2, 2), (3, 2), (4, 4)} (23) 
Position = {(0, 2), (1, 2), (2, 2), (3, 2), (4, 2)} (24) 

Before applying the least squares method each division number must be translated back into the bucket. 
Translation is done using the following code fragment: 



for i <— to number _oj "-divisions —1 

pomt[i][0] <— i * histogramjwidth + lowerbound 

point[i][l] <— histogram_value[i] 
end for 



Translation of the points from (|23|) and ([24)1 gives: The histograms for velocity and position in each view 
are given as: 

Velocity = {(5, 1), (6, 1), (7, 2), (8, 2), (9, 4)} (25) 
Position = {(5, 2), (6, 2), (7, 2), (8, 2), (9, 2)}. (26) 

Using the least squares method to fit each of these to a line yields the following for each velocity and position 
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dimension: 



Velocity : y = 0.7x - 2.9 (27) 

Position : y = Ox + 2. (28) 

Evaluating Equations ([27)) and (|28|) at the end points to find the shift value for the axis trend function to 
add to each equation gives: 

Velocity : y(5) = 1, y(10) = 4.3 (29) 

Position : y(5) = y(10) = 2. (30) 

In this case no constant needs to be added to our equation and the trend function becomes: 

fi = (0.7x Q - 2.9)(0afi + 2)(0.7:E2 - 2.9)(0:E3 + 2)(0.7x 4 - 2.9){0x 5 + 2) (31) 

Calculating Fi from Equation (fTTj) requires integrating /, over the bucket where f g = ... J 5 10 and where 
d(j) = dxQdx\dxidx?,dx±dx§ gives 



fid<j) = 8 / (0.7a; - 2.9)(0.7a;2 - 2.9)(0.7x 4 - 2.9)d0 

Bi JBi 

= 1622234.375. (32) 



Since all the points reside in a single bucket, 6; = n, the constant c is given by c = 1/1622234.375 w 
6.164 x 10~ 7 . Then Fi is given by 

f(Wc (0.7x - 2.9) (Osi + 2)(0.7x 2 - 2.9)(0x 3 + 2)(0.7x 4 - 2.9)(0x 5 + 2) 

= 8c(0.7x - 2.9)(.7ar 2 - 2.9)(.7x 4 - 2.9) (33) 

So far we have calculated the normalized trend function Fi for just one bucket. This calculation finishes 
the bucket creation process, and the index contains this single bucket defined by the points lowerbound = 
(5, 5, 5, 5, 5, 5) and upperbound = (10, 10, 10, 10, 10, 10). 

3.2 Approximating the Number of Points in a Bucket 

As a line through a query point sweeps across a bucket, the points in the bucket that dominate the query 
point are approximated by the integral over the region above the line. In each of the three views the query 
space intersects the plane giving the cases shown in Figure [5) 
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(a) Upper Left (b) Upper Right (c) Except (d) Except 

Lower Left Lower Right 




(e) Both Left (f) Both Right (g) Both Upper (h) All 

Increasing Slope Decreasing Slope 



Figure 8: Sweep algorithm cases. 



Definition 26 (Percentage Function) Integrating over the region above the line gives an approximation 
of the percentage of points in the query space. We define the percentage function given as: 

\ d<t> (34) 

where r% is the region of the bucket in the query space. If two lines go through the same bucket we have the 
smaller region r^ subtracted from the larger region r\ as follows. 



Ap= I Fide/)- I Fi d(f>. (35) 

Here, regions r\ and r^ correspond to regions above Q\ and Q2 in Figure^ respectively. Lemma \TE\ showed 
that finding the number of points in the bucket requires multiplying Equation or \35}) by n. 

For each case shown in Figure [5J we describe the function that results from integration in one view. To 
extend the result to any number of views, we take the result from the last view and integrate it in the next 
view. If the region below the line were desired, pi ower = — — p gives the percentage of points below the line. 

For cases (a) - (h) below, let Q — (xi t q, X2, q , xe <q ). For the x-view, let the lower left corner vertex 
be (xu,X2,i) and the upper right corner vertex be {x\ -iUl X2 1 u 

). In addition each line denoted I is given by 
X2 = —t(x\ — Xi. q ) + Xi+\ t q and corresponds to a line shown in the corresponding case in Figure[8j 

Case (a): For this case I crosses the bucket at xi t i and X2,u- The integral over the shaded region is given 
by the following: 

—5 rx liq x „ 



Pa 



J J Fl dX2dXl ( 36 ) 



-t(x 1 -X lt g)+X 2 , q 



Notice that the lower bound of the integral over dx2 contains x\. This dependence within each view does 
not affect the integration in the remaining four dimensions. The solution to Equation (|36p has the form: 

at 2 + 6i + c+- + 4- (37) 
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Case (b): For this case I crosses the bucket at x\ u and X2 u - The integral over the shaded region is given 
by: 

Pb = J J Fi dx 2 dxi. (38) 

0»2,t|-»a,g) , _ — t(xi— Xl,q)+X2,q 
1 1)<Z 

The solution has the form of Equation (|37|) . 

Case (c): For this case I crosses the bucket at x\ i and X2,i- The integral over the shaded region above the 
line is given by: 

»2,i-*2,g . 

— FXl.q X2,u Sl.l I2,« 



Fi dx2dx\ + / / Fi dx^dx\. (39) 



Xl,l —t(,Xl—Xl t g)+X2,q Zld—ll2.+ Xl q X2 i l 

The solution has the form of Equation (|37| . 

Case (d): For this case I crosses the bucket at x\ u and X2 i- The integral over the shaded region is given 
by: 

">a,i-*2,t , 

Xl, n X 2 ,u = t <~Xl,q X 2 ,u 

p f = J J Fi dx 2 dxi + J J Fi dx 2 dxi. (40) 

*2,i-*2,g ! xi ^ -i(xi-£Ci, g )+a;2,<, x u s a ,i 

The solution has the form of Equation (|37ll . 

Case (e): For this case I crosses the bucket at x\ i and x\ u . The integral over the shaded region is given 
by: 

x 2~ x 2,q . 
X 2 ,u 

Pc= J J Ft dxidx 2 - (41) 

3!2,1 3=1,1 

The solution has the form of 



which is like Equation (|3T|) with a = 6 = 0. 



Case (f): Similar to case(e), I crosses the bucket at x-yj and x± tU . The integral over the shaded region is 
given by: 

2:2,11 2:1,11 

Fi dx x dx2- (43) 



The solution has the form of Equation (|42[) . 

Case (g): For this case I crosses the bucket at x\ 1 and x\ u . The integral over the shaded region is given 
by: 

P g = J J F i dx 2 dx l . (44) 

3*1,1 — t{x\ — Xl >q )+X2,q 

The solution has the form 

at 2 + bt + c (45) 

which is like Equation (f37f with d = e = 0. 
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Case (h): The line / crosses below all the corner vertices hence the integral of the function is given as: 



Ph= J J Fi dx 2 dx\. (46) 

The solution has the form of Equation (|45[) . 

The above cases have solutions for each view in the form of Equation (j37|) . Hence the percentage function 
for a single bucket as a function of t is of the form: 



V = a x t 



(a x t 2 + M + c* + y + |§J (a y t 2 + b y t + Cy + ^ + |f J 



a z t 2 + b z t + c z +^- + ^\ (47) 

where t ^ when d x , d y , d z , e x , e y , e z ^ 0. Finally, renaming variables gives the general form: 

« k a t 9 d-x do d% dd, d?, dfi , . „. 

p = a 6 t 6 + a 5 t 5 + a 4 t 4 + a 3 t 3 + a 2 t 2 + a x t + c + -j + + -f + -± + -| + -f (48) 



where t ^= when 7^ for 1 < i < 6. Since Equation (|48|) is closed under subtraction, Ap from 
Equation (f35|) will also have the same form. 

As the query space from Definition [22] sweeps through a bucket, it crosses the bucket corner vertices. 
Each time a corner vertex crosses the query space boundary, the case that applies may change in one or more 
of the views. 

Definition 27 (Bucket and Index Time-Intervals) The span of time in which no vertex from bucket 
Bi enters or leaves the query space defines a bucket time-interval. We denote the time-interval as a half- 
open interval [l,u) where I is the lower bound and u is the upper bound. Each bucket time-interval has an 
associated percentage function Ap given by Equation \35\) . We define the index time-interval similarly except 
that the span of time is defined when no vertex from any bucket in the index enters or leaves the query space. 

As we will see, index time-intervals are created from individual bucket intervals. Throughout the rest of 
this dissertation we use the term time intervals when the context clearly identifies which type we mean. 

Definition 28 (Time-Partition Order) Let B be the set of buckets. Let Q\ and Q 2 be two query points 
ana (it,*!) be the query time interval. We define the Time-Partition Order to be the set of ordered time 
instances TP — ti : t 2 , such that t\ — fi and t^ = t\ and each [tj,tj_|_i) is an index time- interval. 

Example 29 (Calculating Bucket Time-Intervals) Continuing Example l25l let Q be a query defined 
by: 

qi = (9.5, 8, 9.5, 8, 9.5, 8) (49) 
q 2 = (8.5, 5, 8.5, 5, 8.5, 5) (50) 
T = (0.1, 10) (51) 

where q\ and q 2 form the query space over the query time interval T. To determine time intervals when corner 
vertices do not change, find the slopes of lines through both query points and each corner vertex of the bucket. 
Figure [9] shows lines from the two query points to the corner vertices for the first dimension. Since the query 
points are the same in each dimension each will appear the same. The set of times when lines through q\ 
(shown as solid lines) cross corner vertices is {0.4, 6}. The set of times when lines through q 2 (shown as dotted 
lines) cross corner vertices and are in the time interval is {1.42857}. The union of these two sets along with the 
end points makes up the times used to create the time intervals: {(.1, 0.4), (0.4, 1.42857), (1.42857, 6), (6, 10)}. 

Integration over the spatial dimensions of the eight possible cases presented in Figure [8] gave a function 
of the form of Equation (|48|) . Maximizing Equation (|48|) in the temporal dimension by first taking the 
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Figure 9: Lines from query points to corner vertices. 



derivative, we get: 



Ap' 



5a 5 i n + 4a 4 i 10 + 3a 3 t 9 



2a 2 t 8 + axt 7 



(6a 6 t 12 

- dxt 5 - 2d 2 t A - 3d 3 t 3 - 4d 4 t 2 - 5d 5 t - 6d 6 )/t 7 



(52) 



where t 7^ 0. Solving Ap' = requires finding the roots of this 12-degree polynomial, which is not possible 
using an exact method. Hence we need a numerical method for solving the polynomial. 
The following factors influenced the choice of the numerical method: 

1. Speed of the algorithm is more important than accuracy because we don't expect the original function 
to change dramatically over an index time-interval. We expect small change because in practice the 
time intervals are short. 

2. The algorithm must converge toward a solution within the interval, that is the algorithm must be 
stable. 



3. Given that we are maximizing Equation (|48|) over a short time interval, we don't expect Equation (|52j) 
to have more than one solution. This assumption may seem naive, but it is reasonable given factor (1). 

Factor (1) above is related to (3) in that it indicates that points close together have similar values, but 
emphasizes that speed is the goal. Factor (2) above eliminates several algorithms from consideration, but 
must be required to keep from choosing a solution that is not within the time interval evaluated. 

Of the three points to consider, (3) is probably the least intuitive. Consider the following conjecture: 

Conjecture 30 Given p for a set of buckets, if the Euclidean distance between two maxima is small, then 
the difference between the maxima is small. 

Consider the physical characteristics of the system. The value of p over the time interval changes no 
more than bi for any bucket Bi . Clearly p either increases as it encompasses more of the bucket or decreases 
at as it encompasses less of the bucket. When p represents the distribution over several buckets, each bucket 
contributes a decreasing or increasing amount over the time interval. Clearly p is bounded below by and 
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above by ^6,;. Hence, the rate at which the derivative p' changes is characterized by the physical system 

i 

and reflects the differences in the buckets as t changes. Since p does not change dramatically over t for any 
bucket, then change in several buckets over t will likewise not be dramatic. Hence if the distance between 
two maxima is small, the maxima have a small difference in magnitude. This rational for the conjecture 
above is verified by the experiments. 

Based on these factors, we use a common method for the first approximation: we look at the graph of p 1 . 
Programmatically check c intervals of Equation ([52")) for a change in sign. If there exists a sign change, use 
the bisection method to find the root. If two points lie within e of 0, we perform a check for each of these 
intervals when no change of sign is found. If some roots exist, we check them for maximal values along with 
the end points. 

Lemma 31 The approximate maximum within a time interval can be found in 0(1) time. 

Proof. Each time interval has an associated probability function Ap which is calculated in O(l) time. 
Finding Ap' = also takes 0(1) time. By placing a constant bound on the number of iterations in the 
bisection method, we bound the time required in the numerical section of the algorithm by a constant. 
Plugging in the solution found by the bisection method along with the end points also takes 0(1) time. 
Hence, the running time to find the maximum within a bucket is 0(1). ■ 

We chose to limit the number of iterations in the bisection method to 10, which limits the running time 
to a small constant value. This value was chosen based on empirical observation that index time-intervals 
remain small (about 0.01 to 4). Hence, using the bisection method allows us to narrow our search down to 
an interval at least as small as units of time. If time is measured in hours, this interval equates to only 
14 seconds. 

Example 32 (Building Time-Intervals and Finding MaxCotjnt) Continuing Example [29] we build 
the functions for time intervals 

{(.1, 0.4), (0.4, 1.42857), (1.42857, 6), (6, 10)} (53) 

by integrating using the different cases from Figure [8] For space concerns we omit the integrals here and note 
that the result of integrating each interval and finding the maximum gives a maximum of approximately 3 
at t = 0.4 

Time Interval: [0.1, 0.4]. Here case (c) holds for query point qi over this time interval. Hence the integral 
for query point q2 and t 6 [.1, .4] in each dimension is given as: 

,.10 i-W p&.5 i-W 

p c = c / 2(0.7ir - 2.9)d Xl dx + / 2(0.7a;o - 2.9)cfoicfe 

J8.5 J5 J5 J -i(x -8.5)+5 

= 117.5 - 17.3546t (54) 

Case (g) holds for query point q\ and thus the integral for query point q\ and t € (.1, .4) in each dimension 
is given as: 

,.10 <-io 

Pg = c / 2(0.7a;o — 2.9)dxidxo 

Jh J -t(x -9.5)+8 

= 47.0- 32.416t (55) 
Hence the integral of the region is: 

p = c(p c -p g ) 

= 2.106 x 10~ 3 i 3 + 2.957 x 10~ 2 i 2 + 0.138i + 0.216 (56) 

Evaluating p at the start and end of the time interval we have p(0.1) « 0.23 and p(QA) — 0.28. Figure [TOl 
shows p in the time interval. Clearly p is increasing and consequently we have a maximum at the end point 
t = 0.4. Since there are 10 points we must multiply p(0.4) by 10 to get the approximation for the time 
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Figure 10: Graph of p, 0.1 < t < 0.4. 
interval as: 

MaxCount 01<t<0 j w 2.8. (57) 

Since we can not have partial points, we can round this result to 3. 

The rest of the intervals are similar using different cases. We omit the remaining cases to save space and 
to eliminate the risk of boring the reader. None of the other intervals has a higher MaxCount and SO it 

FOLLOWS THAT MAXCOUNT HAS AN APPROXIMATE VALUE OF 3 AT TIME t = 0.4. 

3.3 Dynamic MaxCount Algorithm 



MaxCount(H, Qi , Q 2 , t l ,t ] ) 

input: A set of buckets H built by the index structure presented, 

query points Qi(t) and Q,2(t) and a query time interval 
output: The estimated MaxCount value. 



01. Tirnelntervals <- 0(1) 

02. for % <- to \H\ - 1 O(B) 

03. CrossTimes <— CALCULATECROSsTiMEs(Qi,Q 2 ,t [ ,t 1 ,-ff<) 0(1) 

04. for j <- 1 to |C"rossTimes| - 1 0(1) 

05. Umo^(TimeIntervals,TimeInterval(tj-i,tj) 0(1) 

06. end for 

07. end for 

08. Tirnelntervals = BvCKETSORT(TirneIntervals) 0(B) 

09. IndexTimelntervals = MERGE(TimeIntervals) O(B) 

10. for each IndexTimelnterval £ IndexTimelntervals 0(B) 

11. CALCVLATE(MaxCount,MaxTime, IndexTimelnterval) 0(1) 

12. end for 



13. return (MaxCount, MaxTime) 
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Figure 11: Areas of successive slopes. 



The algorithm to compute MaxCount with each line labeled with its running time is given above. Line 
01 initiates a set of bucket time-interval objects to be empty. Line 03 returns a list of ordered times when 
a line through Qi or Qi crosses a bucket corner vertex. Line 05 turns this list into a set of Timelnterval 
objects and adds them to the set of Timelntervals. We list this "for each" loop as 0(1) because it consists 
of a constant number of calculations bounded by the number of vertices in the bucket. Line 08 uses the linear 
time sorting algorithm BucketSort to sort the bucket time intervals. Line 09 creates the time-partition 
order and index bucket time intervals from the bucket time intervals in O(B). An additional pass adds the 
bucket time intervals to the appropriate index time-intervals in O(B). Lines 10-12 perform the MaxCount 
calculation discussed above. 

In order to use the linear time BucketSort algorithm, we need the following definition and lemmas. 

Definition 33 (Time-Interval Ordering) We define the lexicographical ordering ~< of two time intervals 
A and B as follows: 

A.l < B.l => A < B (58) 
A.l = B.l A A.u <B.u => A<B (59) 
A.l = B.l A A.u = B.u => A = B (60) 

The distribution of time interval objects created in Line 08 of the MaxCount algorithm may not be 
uniform across the query time interval T = However, we can still prove the following. 
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Lemma 34 If the distribution of buckets is uniform, then the distribution of bucket time-interval objects 
can be uniformly distributed within the sorting buckets of the bucket sort. 

Proof. Consider the relationship between successive slopes measured as the angles between lines through a 
query point Q with slopes s,; = —ti and Si+i = — tj.fi. Suppose At = 1 with to = and t\ = 1, then the 
angle between the two lines is As = ~. The solid lines in Figure QT] show that half of the bucket corner 
vertices are swept by the line sweeping through Q between so = and s± = — 1. Consider a query time 
interval [0, 10]. Half of the corner vertices, and thus half of the time intervals, are between time t = and 
t = 1. Thus, we conclude that the time interval objects created by sweeping will not be uniformly distributed 
throughout the query time interval. 

Let Q' be the midpoint between Q\ and Qi- Let S — {ti, ...i/c} where t\ = t^, t/. — $ and fj+i — ti~L for 
some positive constant L and 1 < i < k— 1. Let flg bea bucket that contains the space in the 6-dimensional 
index. Model the normalized bucket function for Db as a constant F = 1. Thus p, the bucket probability, 
from Equation (|47|) becomes the hyper- volume of the space swept by the line through Q'. By Lemma [3T1 
we can find the area for a specific time interval in S in constant time. The percentage of sorting buckets, 
posbi, needed in any time interval Ti = [U, ti + {\ G S within the query time interval is given by: 

POsk = — 7-TT Tip" (61) 

Let iV be the number of sorting buckets. Then, the number of sorting buckets, nosbi, assigned to interval i 
is given by: 

nosbi — N ■ posbi (62) 

If nosbi < 1 we can combine it with nosbi + \. If the query time interval is very large, then we may need to 
include multiple time intervals from S to get one sorting bucket. Thus, we create more sorting buckets (with 
smaller time intervals) in areas where the expected number of bucket time intervals is large. Conversely, we 
create fewer sorting buckets (with larger time intervals) in areas where the expected number of bucket time 
intervals is small. Hence we model each sorting bucket so that its time interval length directly relates to 
the percentage of bucket time intervals that are assigned to it. Thus, we conclude that we will uniformly 
distribute the time interval objects across all sorting buckets. ■ 

Lemma 35 Insertion of any bucket time-interval object To into the proper sorting bucket can be done in 
O(l) time. 

Proof. The distribution of sorting buckets is determined by k time intervals in Lemma [34j Call these 
sorting time interval objects where each object contains: the lower bound I, the upper bound u, the number 
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of sorting buckets assigned to this interval b s , the length of the time interval for the sorting bucket w and an 
array B p containing pointers to these sorting buckets. Let A be the array of sorting time interval objects, 
and L be the length of each time interval where the time intervals are as in Lemma 1341 Then, finding the 
correct sorting bucket for To requires two calculations: 



SortingTimelnterval = A 
SortingBucket = B t 



Tp. I 
L 



Tq-1 — SortingTimelnterval . I 



(63) 
(64) 



Each of these calculations requires constant time, hence To can be inserted into the proper sorting bucket 
in 0(1) time. ■ 

Using the above two lemmas, we can prove the following. 

Theorem 36 The running time of the MaxCount algorithm is O(B) where B is the number of buckets. 

Proof. Let H be the set of buckets where each bucket Bi contains the normalized trend function Fj. Let 
Qi and Q2 be the query points and be the query time interval. (Lines 01-07): Calculating the time 

intervals takes 0(B) time because the cross times for each bucket can be calculated in constant time. (Line 
08): By Lemmas 1341 and 1351 we have an approximately even distribution of time interval objects within the 
sorting buckets where we can insert an object in constant time. This result fulfills the requirements of the 
BucketSort, 
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( 20011 ). which allows the intervals to be sorted in 0(B) time. (Lines 09-12): 
Calculate the MaxCount and time for each time interval in constant time using Lemma |3~T1 These lines 
takes O(B) time because there are O(B) time intervals. Finding the global MaxCount and time requires 
retaining the maximum time and count at line 11. Returning the MaxCount and time takes 0(1) time. 
Thus, the running time is given by 0(B) + 0(B) + O(B) + 0(1) = 0(B). ■ 



3.4 An Exact MaxCount Algorithm 

The Exact MaxCount algorithm below finds the exact MaxCount values. It is easy to see that the running 
time is given by: 

0(N) +0(n\ogn) (65) 

where N is the number of points in the database and n represents the result size of the query. 

It is possible to slightly improve the algorithm below. First, divide the index space into k subspaces and 
maintain separate partial databases for each. Assign processes on individual systems to each database to 
calculate the MaxCount query and return the time intervals to a central process. Merging the time interval 
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lists into a global time interval list saves time on the sorting part of the algorithm. The running time for 
each of k partial databases would be close to O(^log^). This result is an approximate value because we 
do not guarantee an even split between partial databases. Placing buckets for each partial database in a 
Tree structure may be reasonable and could cut down the average running time to 0(log N + nlogn/fc). 
Implementation and analysis for this particular approach is left as future work. 



ExactMaxCount(D, Qi,Q 2 ,t l ,t ] ) 



input: D is the database of points. The query is made up of a 

hyper-rectangle Q defined by points Qi and Q2 and the time 

interval T=[t l ,t ] ] 
output: The exact MaxCount and time at which it occurs. 

01. Times <- //of CrossTime objects 0(1) 

02. for each point pi £ D O(N) 

03. if pi e Q during T O(l) 

04. EntryTime^ CalculateEntryTime( Pi ,Q,T) 0(1) 

05. ExitTime <- CalculateExitTime(p t , Q, T) 0(1) 

06. if EntryTime € Times O(l) 

07. Times.GET(EntryTime).Count++ O(l) 

08. else 

09. Times. AT>T>(newCrossTime(EntryTime)) O(l) 

10. end if 

11. if ExitTime € Times O(l) 

12. Times. GET (ExitTime). Count-- O(l) 

13. else 

14. Times. ADD(newCrossTime(ExitTime)) O(l) 

15. end if 

16. end for 

17. Sort(T imes) O(nlogn) 

18. TRAVERSE(Times, time, Max-Count) //tracking time O(N) 

//and MaxCount 

19. return (time,MAxCouNT) O(l) 
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4 Threshold Operators 



ThresholdRange^, Qi,Q2,t\t\ M) 

input: A set of buckets H build by the index structure presented, 
query points Qi(t) and Q2(t), a query time interval [t',t'], 
and M is the threshold value 

output: The estimated set of time intervals where R contains more 
than M points. 



01 - 08 are the same as the MaxCount algorithm. 

09. Timelntervals <- 0(1) 

10. for each Timelnterval £ TimePartitionOrder 0(B) 

11. CMaxCount <— calculate(MaxCount, MaxTime, Timelnterval) O(l) 

12. if CMaxCount > M 0(1) 

13. Timelntervals <— TimeIntervals{jTimeInterval 0(1) 

14. end if 

15. end for 

16. MERGE(TimeIntervals) O(B) 

17. return Timelntervals 



The ThresholdRange algorithm shown above and described in Definition [3] relates to MaxCount in 
the way we calculate the aggregation. We maintain a running count to find time intervals that exceed the 
threshold value M. If we set the threshold value near the MaxCount value (M — > MaxCount), Thresh- 
oldRange finds a small interval containing the MaxCount. We demonstrate this in the experimental 
results, Section [5] 

The ThresholdRange algorithm is the same as MaxCount up to Line 08, and then collects different 
information from each Timelnterval starting in Line 10. This leads to the following Theorem. 

Theorem 37 The estimated ThresholdRange query runs in O(B) time. 

Proof. The ThresholdRange algorithm differs from the MaxCount algorithm only in lines 09-17. Lines 
11-14 run in 0(1) time. Line 10 executes lines 11-13 0(B) times. In line 16, MEKGE(TimeIntervals) is a 
linear walk of the time intervals that joins adjacent time intervals T a and Tf, when T a [J Xf, would form a 
continuous time interval. The calculation is trivially 0(1) time for joining the adjacent intervals. Hence, we 
conclude by Theorem l3"ol that the ThresholdRange runs in O(B) time. ■ 

4.1 Threshold: Sum, Count and Average 

We give the following three operators based on ThresholdRange and conclude that none of the changes 
to the algorithm affect the running time of the ThresholdRange algorithm. 



23 



ThresholdCount: 

By adding a line between 14 and 15 in the ThresholdRange algorithm that counts the merged time 
intervals, we can return the count of time intervals during the query time interval where congestion occurs. 
This count of time intervals gives a measure of variation in congestion. That is, if we have lots of time 
intervals, we expect that we have a large number of pockets of congestion. Since ThresholdCount does 
not give information relative to the entire time interval, it may need to be examined in light of the total 
time above the threshold. 

ThresholdSum: 

By summing the times instead of using the (J operator in line 13 of the ThresholdRange algorithm, we 
can return the total congestion time during the query time interval. This total gives a measure of the severity 
of congestion that may be compared to the length of query time. 

Threshold Average: 

By adding a line between lines 14 and 15 in the ThresholdRange algorithm that finds average length of 
the merged time intervals, we can return the average length of time each congestion will last. This average 
gives a different measure of the severity of each congestion. 



4.2 Count Range Algorithm 

The CountRange algorithm is an adaptation of MaxCount in that it is the Count portion of the 
MaxCount query. Using the equations for the cases described in Figure [51 we calculate the CountRange 
as follows: 




Figure 12: CountRange Q\ at $ to Q 2 at tL 




Figure 13: CountRange Qi at t [ to Q 2 at d. 
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For each bucket we determine if the bucket is completely in or completely out of the query space. First 
we find the beginning and ending time intervals. For each time interval, we get the associated function Ap 
given in Equation (|35[) and its components. The components Ap given in Equation (|34[) define the area 
above a line through Qi and Q2 at times and ib Figures [T2l and [TBI show these four lines. Figure [T2l shows 
the shaded area defined by: 

A< P =PQ2,tl -PQut\- (66) 



Figure [13] shows the shaded area: 



A P*=2>Q a ,ti -PQuti- (67) 



If A p or A p for bucket i is equal to the count of the bucket, then bucket i is completely contained in the 
query. If A*p and A~j) for bucket i are equal to 0, then bucket i is not contained in the query. If neither 
of these is true, we approximate the count for bucket i as the max(A < p", A"p). That is, we calculate the 
number of points in bucket i that contribute to the CountRange as: 



count; 



bi if Ap = bi V Ap = 6 S ; 

if Ap = Ap=0 (68) 

max(A'p~, A p*) Otherwise 



This calculation requires that we keep the single dimension equations for Q\ and Q2 available and not discard 
them after finding Ap (see Equation (|35p). 

Hence, we have the following algorithm for CountRange: 

CountRange^, Qi, Q 2 , t l , t ] ) 

input: A set of buckets H built by the index structure presented, 

query points Qi(t) and Q2(t) and a query time interval (t\$). 
output: the estimated CountRange. 

f. Count <- 0(1) 

2. for each bucket Bi G D O(B) 

3. Calculate(AV. A~p) //using Equations ((66|l - ((67|) 0(1) 

4. CALCULATE(coMnt;) //using Equation (f55|) 0(1) 

5. Count <— Count + count i O(l) 

6. end for 

7. return Count O(l) 



Theorem 38 The CountRange query runs in 0(B) time. 

Proof. Consider two different data structures for our buckets: HashTables and R-trees. In the case of 
indexing using an R-tree, the worst case requires that we examine all buckets used in generating COUN- 
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tRange. It is possible that this list could include all B buckets giving a worst case of 0(B). In the case of 
using a HashTable, we must examine all B buckets. By Lemma [T51 and because Equations and 
are calculated in constant time, each bucket can be examined to determine the count that contributes to the 
CountRange query in constant time. Therefore, the algorithm runs in 0{B) time. ■ 

We note that CountRange is a simplification of the MaxCount operator in that we do not examine 
every time interval. Further we have a slightly different form of Ap from Equation (|35|) to find the count. 



5 Experimental Results 

We collected data from over 7500 queries that were selected from a set of randomly generated queries. 
The selection process weeded out most similar queries and kept a set that represents narrow queries, wide 
emeries, near corner or edge queries, and queries outside the space contained in the database. Throughout 
our experiments, we did not see significant accuracy fluctuation due to any of these types of queries. 

Each experimental run consists of running all of the queries at several different decreasing bucket sizes 
on a single data set. We made experimental runs against data sets ranging from 10,000 points to 1,500,000 
pointj^- 

In the following experimental analysis, we measure the percentage error of the estimation algorithm 
relative to the exact-count algorithm as follows: 

\Exact Operator — Estimated Operatorl 

Error Re i a ti ve = — (69) 

Exact Operator 

Equation (|69|) provides a useful measure if the query returns a reasonable number of points. Queries that 
return a small number of points indicate that we should use the exact method. 

For ThresholdRange, we measure the percentage of intervals given by the accurate algorithm not 
covered by the estimation algorithm using the operator UC for uncovered. That is, UC(a,6) returns the 
sum of the lengths of intervals in a not covered by intervals in b. We divide the result by the accurate 
ThresholdSum to determine the ThresholdRange error: 

UC (Ext. ThresholdRange, Est. ThresholdRange) 
Ext. ThresholdSum 

We also measure the percentage of intervals given by the estimate algorithm not covered by the exact 
algorithm. We divide the result by the estimated ThresholdSum to determine the ThresholdRange 



1 Threshold aggregation runs go only to 1 million points at which we already achieve acceptable error. 
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EXCESS-ERROR. 



excess-error = 



UC (Est. THRESHOLDRANGE\£<rf. THRESHOLDRANGE) 

Est. ThresholdSum 



(71) 



We performed all the data runs on a Athlon 2000 with 1 GB of RAM. During each of the queries the 
program does not contact the server tier and, thus, minimizes the impact of running a server on the same 
computer. The program pre-loads all data into data structures so that even the exact algorithms do not 
contact the server tier. 

5.1 Data Generation 




D 10 2D 30 40 EH 50 7H BD SO IDD 
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Figure 14: AT- View, Y-view and Z-view of sample data. 



Data for the experiments was randomly generated around several cluster centers. The i th point generated 
for the database is located near a randomly selecte d cluster at a distance betw een and d, where d is propor- 



tional to i. This method is similar to the Ziggu rat (IMarsaglia fc Tsane 



or n ormal) distributions used in the GSTD (Th codoridis et al 



20021 ) spatiotemporal data generators (jNascimento et al 



20001 ) method of generating gaussian 



1999 ) and G-TERD (jTzouramanis et al 



2003). However, our method does not generate 



strictly Gaussian distributions since the distributions may stretch and compress along an axis. Our goal was 
to generate a cluster that represents a source location and velocity that has most elements starting near a 
center point and decreasing as one moves to a boundary for the cluster. This method models source regions 
where the objects all head about the same direction. A secondary goal was to make certain that clusters 
were random in size and shape. The program is also capable of appro ximating a Zipf distribution used in 



(IChoi fc Chung 



2002 



Revesz fc Chen 



2003 



Tao. Sun fc Papadias 



20031 ). However, a single Zipf distribution 



does not test the adaptability of our algorithm well. I.e. our algorithm is capable of modeling a Zipf dis- 
tribution and as such we could use a single bucket. Figure [TJ] shows a sample of a data set with points 
projected onto the three views. The clusters look even more random, because they can overlay one another. 
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When one looks at these, they nearly resemble the lights of a city from the air. 

Along with a single Zipf distribution, we also note that a randomly generated uniform-distribution is not 
a good distribution to use for these types of experiments. Uniform distribu tions do not test the ability of the 



20061) we have found that using such a 



algorithm to adapt. In fact from earlier experiments in (jAndersonl 
distribution gives great (though meaningless) results. The problem resolves to a system capable (and willing 
to) model a uniform distribution finding a nearly perfect uniform distribution to model. Hence these results 
are neither realistic, nor meaningful. 

5.2 Parameter Effects 

The index space ranges from to 100 in each dimension. The number of points in the different data sets 
ranges from 10, 000 to 1, 500, 000. The following parameters were used in creating the index and finding the 
MaxCount. 

Size of Buckets: The size of the buckets determines the number of possible buckets in the index. In the 
experiments, buckets divide the space up such that there are 5 to 20 divisions in each dimension^. These 
divisi ons equate to bu cket sizes ranging from 5 to 20 units wide in each dimension. Relative to our previous 



work (Anderson 



2006), this algorithm puts much more space into each bucket creating bigger buckets. 



Query Location: Locating the query near the lower or upper corners affects relative accuracy because the 
query returns very few points. Queries in this region are not interesting because they rarely involve many 
points and represent a query region that moves away from points in the database or barely moves at all. 
The small number of points returned indicates use of the exact algorithms. 



Query Types: In (jAnderson 



20061 ). we considered queries with several different characteristics: dense, 
sparse, and Euclidean distance as it related to bucket size. By modeling the skew in buckets, we minimize 
the effect of these characteristics to the point that they did not impact the query error. Queries where the 
distance between the query points was small appeared to do as well as wider queries providing they returned 
a reasonable number of points. This result is a clear improvement over previous work that assumed uniform 
density within a bucket. 

Cluster Points: Index space saturation determines the number of buckets necessary for the index. The 
number of cluster points does not appear to affect error as much as the space saturation. Further, we do not 
consider a larger number of cluster points reasonable since the index space approaches a uniform distribution 
as the number of cluster points increases. Gaps introduce difficult areas to model when they are not uniform. 



2 Some MaxCount runs included up to 40 divisions increasing accuracy, but not enough to warrant the extra running time. 
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And once again we reiterate, uniform distributions are not useful. In our experiments cluster points number 
between 10 and 50. 

Histogram Divisions: Increasing histogram divisions to s > 5 had no affect on the accuracy. This result 
is not unexpected because histograms are used to define a trend function relative to trend functions on 
other axes. Increasing the histogram divisions has a tendency to flatten the lines. However, normalization 
flattens the trend function while maintaining the relationships between trends and hence this behavior is 
easily explained. Thus, increasing histogram divisions only increases the running time without increasing 
accuracy. 

Threshold Value: The threshold value determines the accuracy when set to low values compared to the 
number of points in the database. As expected, these extreme point values produce accurate estimations. 
High values also follow this trend. 

Time Endpoints: When dealing with either small time end points or small buckets, the method is suscep- 
tible to rounding error. In particular, Equation (|48[) contains both t 6 and A terms. For very small values, on 
the order of 1 x 10 -54 for 64-bit doubles, these calculations are extremely sensitive and care must be given to 
guard against rounding error. Those errors showed in two ways. First, by a direct warning programmed into 
the solution, and second, by a series of fairly stable time values for the MaxCount followed by unstable 
variations when increasing the number of buckets. At some point, smaller bucket sizes increase the likelihood 
of errors in both time and count values. Also smaller buckets contain fewer points, which impacts the size of 
the constants in Equation (148|) . Hence, as the bucket size becomes smaller in successive runs, the existence 
of instability in the time values after a series of stable values predicts that an accurate MaxCount may 
be found in the previous larger bucket size. Throughout our experiments, this condition was an excellent 
predictor of an accurate MaxCount. 

The experiments demonstrated that 6-dimensional space compounds the problem when creating small 
buckets. Creating an index with unit buckets would result in the possibility of having 1 x 10 12 buckets. 
Clearly this number is unrealistic for common moving object applications where we may be dealing with 
million(s) of objects. In practice the number of buckets needed to reach acceptable error levels was between 
78, 000 and 227, 000 buckets. These numbers reflect the ability to reach error levels under 5% and were 
roughly related to the saturation of the space by the points. It should be clear that a higher saturation of 
the space by points would require a larger number of buckets. Figure [15] shows that we had a roughly linear 
increase in the number of buckets for an exponential increase in the space. This pleasant surprise indicates 
that for unsaturated data sets, the exponential explosion of space is manageable. 
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Index space's width in Buckets in each dimension 

Figure 15: Ratio of the number of buckets in the index to the width of the space measured in buckets. 
5.3 Running Time Observations 

Figure [16] shows the average ratio of the exact MaxCount running time to the estimated MaxCount 
running time as a function of the number of points in the database. This result shows a nearly exponential 
growth when comparing the values between 10,000 and 1,000,000. The leveling off occurs because the 
number of points returned by the queries of 1 million points nearly equals the number of points returned by 
the queries of 1.5 million points. This result precisely matches our running-time analysis of the exact and 
estimation algorithms. 




Points xlOOO 



Figure 16: Ratio of exact running time to estimated running time. 
A natural question is when to use the exact versus the estimated methods. In runs with a small number of 
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points that need to be processed, the exact and estimation methods run about equally fast. However, when 
the result size reaches values greater than 40,000 (our experiments returned sets as large as 331,491), the 
estimation algorithms run up to 35 times faster than the exact algorithms. Further, we note that the error 
is less predictable at smaller results sizes. Hence for small databases or in queries that return small result 
sets, efficiency and accuracy both indicate using the exact method. However, for large data sets greater than 
or equal to 1 million points, the estimation method greatly out-performs the exact method. 

5.4 Operator Observations 

As expected, we noticed that each operator runs in about the same time as MaxCount. Only error values 
seemed to be different when studying different types of aggregation (e.g., when studying overlap error in 
ThresholdRange versus count error in MaxCount). Never-the-less, we have similarities between the 
results. Almost all the figures in this section look like a view of mountains from a valley. That is what we 
expected to see and the lower and flatter the terrain the better. Buckets increase from back to front and 
point set sizes increase from left to right. 

5.5 MaxCount 

Figure [T7] shows that increasing the number of buckets to the indicated values dramatically decreases the 
MaxCount error. As the number of points increases we also see a decrease in the error. Note that for 
larger buckets (e.g. smaller values on the "Buckets per Dimension axis"), the error decreases at a slightly 
faster rate. 

The exact MaxCount provided the values against which our estimation algorithm was tested for accu- 
racy. Since the method does not rely on buckets, and has zero error, we note only that on queries with small 
result sizes, this method performs as well, or better than the estimation algorithm. 

5.6 ThresholdRange 

Figures HH and [H] give the ThresholdRange error and ThresholdRange excess error respectively for 
T = 10. ThresholdRange error gives the percentage of the exact intervals not covered by the estimation 
value, and ThresholdRange excess error gives the percentage of the estimation not covering the exact. 
These figures show that our method acts conservatively in covering more than is needed. However, at larger 
point-set sizes, we still achieve under 5% error. Figure IT51 shows 0% error caused by the point count staying 
above 10% in data sets containing more than 30,000 points. Figure 033 shows that we covered at least 
10% more time in the query time interval than needed until we reach larger point sets. Still, we showed 
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Figure 17: MaxCount error. 

improvement with more buckets. 

At T = 1000, we see 0% error until we reach point sets of 500,000 and greater. Figure |2"01 shows excellent 
results with buckets above 10. Also, Figure |2"T1 shows that the excess error drops to near 0% as well. 

Figures [22] and [23] show what happens when we find an interval near the MaxCount value. The two 
figures show the consequences of the estimation intervals being offset from the exact intervals by small 
amounts. The error decreases with more buckets. 

5.7 ThresholdCount 

This operator is the only operator that does not have relative error measurements. Instead we report the 
average number of intervals the estimation method differs from the exact method. As you can see, we differ 
by two from the correct number. 

Figure |2~H shows the average error at T = 10 where the errors are small. Figure [231 (T = 1000) looks 
much worse, but in reality we are still below 2 intervals off. We also note that the estimation may split or 
combine an interval incorrectly when the intervals are very close together without greatly affecting the error 
of other operators. Given this possibility, the results are excellent. 

5.8 ThresholdSum 

ThresholdSum gives the total time above the threshold T. As one can see in Figure [25] at higher bucket 
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Figure 18: ThresholdRange error. 
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Figure 19: ThresholdRange error. 

counts we have excellent error rates at T = 10. We didn't always expect great results at this threshold level 
across all data sets, but ThresholdSum gives this result consistantly all the way across. 

We do note that when the threshold approaches MaxCount, we see extremely good accuracy as shown 
in Figure 1271 

5.9 Threshold Average 

ThresholdAverage gives the average length of each time interval. Figure [25] shows the now familiar 
mountains descending below 5% error at 20 buckets for T = 10. The Figure also shows that even though 
a few of the data sets tended to have good results at 5 and 10 buckets, these results are not guaranteed in 
general. In Figure 1251 the error reaches a plateau below 5% with only small bumps in the data. 
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Figure 20: ThresholdRange error, T=1000. 
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gure 21: ThresholdRange excess error, T=1000. 




Figure 22: ThresholdRange error, T=100000. 
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Figure 23: ThresholdRange excess error, T=100000. 




Figure 24: ThresholdCount error, T=10. 

5.10 CountRange 

Other CountRange algorithms have achieved error values between 2% and 3%. Using our method we 
conjecture that we could reduce the error because our method of approximation, although much more 
complicated, theoretically adapts to skewed distributions better than other methods. Figure [3D] shows that 
we achieved errors under 2% for 20 buckets across all the data sets, and in some cases, under 1%. 

Count range also performs about the same speed as the threshold operators due to its similar implemen- 
tation. 

Additional information that contains error analyses of all the threshold values is given in I Anderson! (|2007l ). 
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Figure 25: ThresholdCount error, T=100. 
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Figure 26: ThresholdSum error, T=10. 



6 Related Work 



This Section reviews the literature specific to aggregation. Spatial and spatiotemporal databases have 
attracted an enormous amount of interest, and there exists a wide range of literature that is related to 
our wor k only through index i ng. For books on the subjects of sp atiot emporal and constra i nt dat abases we 



suggest: 



Rigaux et al 



( 200lh 



Reveszl (|2002l ). 



Samet ( 199C 



20051) . and 



Guting fe Schneider! (|2005l) 



6.1 MaxCount and CountRange Aggregation 



There exists only a fe w previous algorithms to compute MaxCount (jRevesz fc Chen 



2004 



Anderson 



2003 



Chen fc Revesz 



20061 ). None of those previous algorithms provides efficient queries without rebuilding the 
index (i.e., they do not provide dynamic updates). 

Previous approximate MaxCount solutions use indices from 



Acharva et al 



(f999) that minimize the 
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Figure 28: ThresholdAverage error, T=10. 

skew of point distributions in the buckets by creating hyper-buckets based on the properties of all points 
at index creation time. Updates require the index to be rebuilt because the buckets depend on the point 
distribution at a specific time. In contrast, the probabilistic method we presented recognizes point density 
skew in each bucket instead and creates a density distribution to model it. We present the first efficient and 
dynamic algorithm for MaxCount. Table Q] compares the results of earlier MaxCount algorithms with 
our current algorithm where N is the number of points and B is the number of buckets in the index. 

To our knowledge, we present the first proposal of these threshold aggregate operators for moving points: 
MaxCount (and MinCount), ThresholdRange, ThresholdCount, ThresholdSum, and Thresh- 
oldAverage. 

We can modify Spatiotemporal-Range algorithms to return the CountRange by counting the ob- 
jects returned. Several other algorithms were proposed directly for the CountRange problem. We summa- 
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Figure 29: ThresholdAverage error, T=1000. 




Figure 30: CountRange error. 



rize previous Spatiotemporal-Range and CountRange algorithms in Table [H where N is the number 
of moving objects or points in the database, d is the dimension of the space, and B is the number of buckets. 
All algorithms listed are dynamic, which means that they allow insertions and deletions of moving objects 
without rebuilding the index. 

In all our work we consider time as a continuous variable. Time as a discrete variable is discussed in both 
tempo ral and spatiotemporal aggregation by 



Agarwal et al 



( 2003f ), lTao fc Papadiasl (|2005l ) and 



Bohlcnetal 



(2006). In the discrete approach, time stamps describe the temporal nature of objects. This approach is less 



relevant to our work, but is relevant to many applications. 

1 This is a restricted future time query with expected O(N) space that becomes quadratic if the restriction is too far into 
the future. 

2 C = K + K', where K' is the approxi mation error. 

3 Although iTao. Sun fc Papadiasl (20031) allow dy namic updates, over time the index must be rebuilt. 
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Table 1: MaxCount aggregation complexity on linearly moving objects. 



Max. 


Worst Case 


Space 


Exact 


Static or 


Reference 


Dim. 


Time 




or Est. 


Dynamic 




1 


0(log N) 


0(N' Z ) 


Exact 


Static 


Revcsz & Chen (2003) 


1 


0(B log B) 


0(B) 


Est. 


Static 


Chen & Revcsz (2004) 


2 


0{B log B) 


0(B) 


Est. 


Static 


Anderson (2006) 


d 


0(B) 


0(B) 


Est. 


Dynamic 


Anderson (2007) 


d 


0(N) 


0(1) 


Exact 


Dynamic 


Anderson (2007) 



Table 2: Range and CountRange aggregation summary. 



Max. 
Dim. 


Worst Case 
Time 


Worst case 
Space 


Exact 
or Est. 


Reference 


2 
2 


0(N^ +e + k) 
0(log 2 N + k) 


O(N) 
0(N 2 t 


Exact 
Exact 


Kollios et al. (1999) 


2 


O(N) 


O(N) 


Exact 


Papadopoulos et al. (2002) 


3 


O(N) 


O(N) 


Exact 


Saltcnis et al. (2000) 


d 


O(N) 


O(N) 


Exact 


Porkaew et al. (2001) 


d 


0(B d - l log%N) 


O^logt'N) 


Exact 


Zhang et al. (2003) 


2 


0(log B N + C)/B 


O(N) 


Est. 


Kollios et al. (1999)1 


2 


O(B) 


0(B) 


Est. 


Choi & Chune (2002) 


d 


O(B) 


0(B) 


Est. 


Tao et al. (2003) 


d 


o(Vn) 


O(N) 


Est. 


Tao fc Papadias (2005) 


d 


O(B) 


0(B) 


Est. 


Anderson (2007) 



6.2 Indices and Estimation Techniques 

There are many ways our work is indirectly related to previous work on indexing structures and estimation 
techniques. Count and Max aggregation operators have only a titular relationship to the MaxCount 
aggregation, because one cannot use the Count and Max aggregation operators to implement the Max- 
Count aggregation. Nevertheless, several techniques used in the MaxCount problem are also used in 
other indices and algorithms designed for range, max/min, and count queries. We summarize several of 
these related techniques next. 



6.2.1 Indices 



The index structure of 
0(VN) time 



Aga rwal et al 



(2003) finds the 2-dimensional moving points contained in a rectangle in 
Gunopulos et alJ (|2005f ) gave a selectivity estimation with a histogram structure of overlapping 
buckets designed to approximate the density of multi-dimensional data. The algo rithm runs in consta nt time 
0(d\B\), where d is the number of dimensions and B is the number of buckets. 



Gupta et al 



(2004J) gave a 



technique for answering spatiotempo ral range, intercept , incid ence, and shortest path queries on objects that 



move along curves in a planar graph. 



Civilis et al 



(2004 



20051 ) also gave indexing methods that use networks, 



such as roads, to predict position and motion changes of objects that follow roads and characteristics of 
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routes. IPfoser fe Jensen! (|2005f ) used networks to reduce the dimensionality of constrained moving object 
to two dimensional trajecto ries and examined the method in terms of the spatiotemporal range query. 
de Almeida fc Girting! (|2005l ) proposed the MON-Tree to index moving objects in networks using graphs 
or route oriented networks to find the spatiotemporal range and windows queries. They d efine window 
querie s as returning the pieces of the object's movement that intersects the query window 



Zhang et al 



(|200lh proposed the multiversion SB-tree to perform range temporal aggregates: Sum, Count and Avg in 



Q(\og h n), where b is the number of records per block and n is the number of entries in the database. 



Revesz 



(|2005l ) gave efficient rectangle indexing algorithms based on point dominance to find count interpreted in k 



dimensions using the following concepts: 

1. stabbing gives the number of objects that contain a point; 

2. contain gives the number of rectangles that contain the query rectangle; 

3. overlap gives the number of rectangles that overlap the query rectangle; and 

4. within gives the number of rectangles within the query space. 

These four operators have a running time of 0(log fe n) where k is the number of dimensions and n is the 
num ber of points . 



Saltenis et al 



(|2000h gave an R*-tree based indexing technique for 1,2, and 3 dimensional moving objects 
that provide time-slice queries (selection queries), windows queries, and moving queries. Window queries 
return the same information as range queries, but with a valid time window starting at the current time 
and continuing to t/j. Window queries may request predictions for range queries within this window of time. 
Moving queries, similar to incidence queries, return the points that are contained within the space connecting 
one rectangle at a start time to a second rectangle at an end time. The proposed time param eterized R-tree 
(TPR -Tree) search runs in expected logarithmic time. Another R*- t ree ex tension given by 



Cai fc Revesz 



( 2000 ) forms tighter 



p aram etric bounding boxes than 



Saltenis et al 



(2000) and has similar running time. 



Tao. Papadias fc Sunl (|2003f ) proposed the TPR*-Tree that extends the TPR-Tree with improved insert and 



del ete algorithms. I n the context of a variety of count queries it performs similarly to previous indices. 



et al 



(2004J) uses time-dep endent, updatable, h istograms to query counts at specific times including 



past, present and future. Recently, 



Pelanis et al 



(2006) proposed the R -tree that i ndexes past, present 



and predictive positions of moving points, and extend s the previous work on TPR- Trees (patterns et al 



with a partial persistence framework. Earlier work by 



Taveb et al 



(|1998l ) adapted the PMR-quadtree 



2000) 



Samet 



1990J), a variant of the quadtree structure, for indexing moving objects to answer time-slice queries, which 
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they called instantaneous queries, and infinitely repeated time-slice queries, called continuous queries. Search 
pe rformance is similar to quadtrees and allows searches in 0(log N) time. 



Mokhtar et al 



( 2002 ) use the sweeping technique from computational geometry to define a query language 



to evaluate past, present, and futu r e pos itions of moving objects in constraint databases. 



Finally, 



Hadiieleftheriou et al 



(|2003f ) use an efficient approximation method to find areas where the 
density of objects is above a specific threshold during a specific time interval. This method comes the closest 
to the method used in our aggregation operators, but does not allow for the query to move or change shape 
over time. In fact, this method is not applied to counting at all. 

Note that each of these indexing methods that return the moving points in a query window or rectangle 
can be easily modified to return instead the count of the number of moving points. However, they may not 
be easily extended to provide a MaxCount within a changing, moving query space. 

With a few exceptions you can see that Count aggregation is 0(log N + d) for exact methods and 0(B) 
or better for estimation methods. The hidden constant in the exact method is the number of buckets that 
must be traversed to find the Count. Estimation methods vary in many ways and asymptotic running time 
doesn't always give a meaningful estimate as to how big B will be. 



6.2.2 Estimation Techniques 

Our work is related to several other papers that estimate the count aggregate operation on spatiotemporal 
databases. 



Acharva et al 



(|l999l ) gave an algorithm that can esti mate the Count of th e nu mber of the rectangles 



that i ntersect a query rectangle for Selectivity Estimation. 



Choi fe Chun d (120021) and 



Tao. Sun fc Papadias 



(|2003l ) proposed methods that can estimate the C ount of the moving points in the plane that intersect a 



qu ery rectangle. More rec ently. 



Wolfson fe Yinl (120031) and 



Kollios et al 



Traicevski et al 



1 200 5|) ga ve a predictive method based on dual transformations. 



2004) gave a method for generating pseudo trajectories of 



moving objects. Most of these estimation algorithms use buckets as basic building structures of the index. 
In extending this idea, we use 2<i-dimcnsional hyper-buckets in our algorithms where d is the number of 
dimensions in the moving-objects space. 



7 Conclusions and Future Work 

We implemented and compared two new MaxCount algorithms. The estimated MaxCount was shown to 
be fast and accurate while still allowing fast constant time updates. No other algorithm has these features to 
date. We showed that ThresholdRange, ThresholdCount, ThresholdSum, ThresholdAverage, 
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and CountRange are related to MaxCount and can be evaluated using similar techniques and that we 
achieve error values under 5% in these operations. We gave an empirical threshold for choosing between 
the exact and estimated algorithms. We discussed the issues related to higher dimensions and note that all 
sweeping algorithms have this problem. We also note that using our technique it is possible to decompose 
the problem and run it in a multiprocessor or grid environment where the database is divided into smaller 
databases. 

Future work may include decreasing the running time by finding other techniques because there does not 
appear to be a clear method for decreasing the running time of sweeping methods. One could also consider 
implementing and comparing these techniques in a grid computing environment. 
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